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ABSTRACT 


In this paper, a proposed form of dependent regression models are introduced to study symbolic data. 


The estimation of the proposed linear regression models are based on interval valued data, for which we have lower and 
upper bounds or center and range values. The least squares method is used to estimate the models. A real example data 
are used to illustrate the usefulness of the proposed regression models for handling the inten’al valued data. 


The estimation results are evaluated using the predicted mean squared errors. The results support the proposed dependent 
regression models. 
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1. INTRODUCTION 

In multivariate analysis, huge data sets lead to computational difficulties in the standard form of analysis. 
Therefore, summarizing data into smaller groups than the set of large individuals, has been attracting many statistical 
researchers. Symbolic data analysis (SDA) is defined as an extension analytical tool for the standard data. SDA has been 
considered as a domain related to multivariate analysis, pattern recognition, data mining, machine learning, and artificial 
intelligence. (See Billard and Diday, 2006 [1]; Bock and Diday, 2000 [2]; and Diday and Fraiture-Noirhomme, (2008) [3]). 

One of the forms of symbolic data is the interval-valued form, in which the large individuals are summarized into 
smaller groups or classes. Interval data analysis tools have been used by many authors. (See De Carvalho, 1995 [7]; 
Ichino et al., 1996 [11]; Cazes et al., 1997 [6]; Bertrand and Goupil, 2000 [1]; Laura and Palumbo, 2000 [12]; Laura Et al., 
2000 [13]; Palumbo and Verde, 2000 [16]; Rasson and Lissoir, 2000 [17]; Billard and Diday, 2003 [2]; Gorenen et al., 
2006 [10]; Billard et al., 2007 [4]; and Maia et al., 2008 [14]). 

Fitting a linear regression analysis using interval-valued data is introduced by Billard and Diday, 2000, 
their approach is based on fitting a linear regression model to the mid-points of the interval values in the learning set, 
and applying it to the lower and upper bounds of the interval values of both response and predictor variables. Their work 
improved after that by Lima Neto and De Carvalho, 2008. 
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This paper proposes two linear regression models to be fitted using symbolic interval-valued data. The two 
models are based on center and range of the interval values. 

The paper is organized as follows. Section 2 introduces basics of symbolic data. Section 3 considers the linear 
regression models for classical data. Section 4 describes the proposed linear regression models for symbolic interval data. 
Section 5 presents application studies. Section 6 considers conclusions and recommendations for future studies. 

2. SYMBOLIC DATA 


Symbolic data is defined as an approach to summarize large data sets in such a way that the resulting data is in a 
manageable size. These summarized data contain different types Such as, single quantitative or categorical value, a set of 
values or categories, and an interval, a set of values with associated probabilities or weights. 
(See Bock and Diday, 2000 []). 

Symbolic data mean rather than a certain value x„ an observed value may be multi-valued such as {2,7, 12} or 
{small, medium, large}, it can be interval-valued such as [5, 10), it may be in modal values such as { 1 with probability 
0.7, 0 with probability 0.3}, extra. (See Bock and Diday, 2000 []). 

Let E=/l,2,...,nj denotes the set of units that are described by p symbolic interval-valued variables X h X 2 ,...,X p . 
For each element ksE, the interval X(k) is denoted by \X _ k , X k J, where X_ k and X k are the lower and the upper bound 
of the interval X(k)(Z.R, respectively. It is shown that the variables X(k) are uniformly distributed with mean and variance 
as in the following definitions. (See Bock and diday, 2000). 

Definition (1) 

Let X be an interval-valued variable defined on the set E={ 1,2,...,n}. The empirical distribution function of X , 
denoted by F x , is the distribution function of n discrete uniform distributions defined on the intervals X(k) for k 6E E. 

The empirical density function of A, which is denoted by f x , is defined as: 


' /7 V 


k Xk -x k 


( 1 ) 


The empirical density function f x corresponds to the frequency distribution for a multi-valued variable. 

Definition (2) 

Let X be an interval-valued variable defined on the set E=fl,2,...,n}, and let f x is the empirical density function of 

X defined in (1). Then the empirical mean of X, X , and the empirical standard deviation, S x are defined respectively as 
follows: 


X = 


f+CO 

xf x dx 

J—oo 


1 y x k + x k 
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( 2 ) 
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3. LINEAR REGRESSION MODELS FOR CLASSICAL DATA 

Linear regression is a widely used method to study the relationship between one response and one or many 
predictors. The classical regression analysis is used, when observations are specified by numerical data values. (See Draper 
and Smith, 1981; and Montgomery, 1982). 

Consider the following standard multiple linear regression model, 

Y = X T fi+e (4) 

where, Y is a response variable, X T = (l,X f ...,X p ) is a predictor vector, = (jl , ) is a parameter 

vector to be estimated, and ~ (0, cr~) is the error term. In this model, each X t takes specific value Vj=0,l,2,...,p. 
The objective of the model (4) is to find the best linear relationship between the two variables Y and X. The estimation 

/'V /V 7 _ /V 

problem in model (1) insists in finding a good parameter estimator €E R 1 , such that (Y — X ), holds, and that 
( Y, - Xj ) exists V i={ 1,2,...,n}. 

Recently, different approaches have been considered the analysis of symbolic interval-valued data for regression. 
(See Tanak and Lee, 1998 []; Billard and Diday, 2000); Lima Neto and De Car; Lima Neto and De Carvalho, 2008, 2010, 
and 2011; and Sun and Li, 2015). 

4. THE PROPOSED LINEAR REGRESSION MODELS FOR SYMBOLIC INTERVAL DATA 

4-1. The Methodology 

Consider the linear regression model in (4), which is described by existing (p+1) symbolic interval-valued 
variables Y, X h X 2 ,...,X p . X i =(x i i,x i2 ,...,x ip ), where Xij=[aij,bij] e {[<=?, b\ : 3, b e R, 3 < V i=l,2,...,n, and j=l,2,...,p, 
and Yj=[y L i,yui] are the observed values of both Xj and Y respectively. 

There are two different methods of fitting the linear regression model in (4) that will be considered in this section, 
the center (CM) and the center and range (CRM) fitting methods. 

4-2. Regression Model Using the Center Method (CM) 

The center method was proposed by Bilard and Diday (2000), they showed that the two interval-valued variables 
Y and X are related according to the following relationship: 

Y C =X C + c , (5) 

where Y C =(Y‘ . Y;) X' for 1=1 .n, 

1 MA...- 4 J, *•-(<_„<). ^=(a f+ 6 f )/2,and Y;=(Y Li + Y u )l 2. 

If the matrix X° has full column rank, then the least squares estimator (LSE) of will be as follows: 

A = [(x c J X c ] (x c J Y c . (6) 
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The lower and upper bounds of the predicted values of Y, Y 


YlXu 


will be defined as follows: 


Y L ={Xj P p. 

where ( X L f = (l,a l ,a 2 ....,a p ). and ( X. : J = (l, b v b 2 ,..., t> p ). 


The coefficient of determination of the center method (CM), R CM , will be computed as follows: 


(7) 


-— . <*> 

Zr-H 

/= 1 

where, 

y; = (y u T„)/ 2 .miP = (y u + Y u )/ 2 . ( 9 ) 

4-3. Regression Models Using the Center and Range Method (CRM) 

Lima Neto and de Carvalho (2008), proposed the center and range method, as an approach, for fitting the linear 
regression model in (4). The ideal of this approach is to consider both the information contained in the centers and ranges 
of the interval-valued variables. They showed that the two interval-valued variables Y and X are related, over the centers, 
according to the following relationship: 


Y c = X c c + c . 


( 10 ) 

where Y c =(Y t c ,.. 

-K). x ,: =iix;i Jx:i =(i,x°,...,x? l> ) 

for i=l,...,n. 

F = . ftj, ^ = ( 4 - 

xl={a, + b,)l 2,and Y; = (Y u + Y ui )/2. 



Also, Lima Neto and de Carvalho (2008) showed that the two interval-valued variables Y and X are related, 
over the ranges, according to the following relationship: 

Y r =X r ' + r , (11) 

where Y r = Y r n \ X r = ((*,' (x-J = (l, X- l ,...,X- p ) for i=l,...,n, 

P =(&,-,fipj, r =( r n ), X-j = {by - Sj )/ 2 , for j=l,..., p, and Yf = (K w - Y u ) / 2 for i=l,...,n. 


If the both the two matrices X c , X' and have full column ranks, then the least squares estimators (LSEs) of both c 
and 1 in equations (10) and (11), respectively, will be as follows: 


(x c J x c \x c Jy c , 


( 12 ) 
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and. 

The lower and upper bounds of the predicted values of Y, Y — 

and 


(13) 


YlXu 


, will be defined as follows: 


Y L -Y C - Y r and Y U = Y C +Y r , (14) 

where Y c = (x c J (3 C , and Y r =(x r J 0 r , {x c J =(1, X?X c p ), (x r J = (l, 

JB C - and = (p' m , 

Lima Neto and de Carvalho (2010) introduced the constrained center and range method (CCRM) to insure that 
Y u < Y Ui as follows: 


Y c = X c c + c 


Y r = X r r + 1 


(15) 


With constraints ‘ >0, j=0, l,...,p. 

The least squares estimators of both L and 1 in (15) will be estimated as in (12) and (13), respectively. 

The coefficient of determination of the center and range method, Rq RM , can be derived as in the case of CM as 

follows: 


D 2 _ J =1 _ 

r 'CRM(C) n , _ , ’ 

/ = 1 

for the center, and 

D 2 — i =1 

n CRM(r) ~ n , _ , ’ 

/=1 

for the range, where, 

K =(Yu+ Y U )/ 2- ^ =(V!/+K/,)/2, 

^ = - Y r =(f u ~K,)/2. 


(16) 


(17) 


(18) 

(19) 
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Lima Neto and de Carvalho (2010) suggested different methods to measure the goodness of fit for both CRM and 
CCRM as follows: 


/?’= min (/?’,/?*), 


or 


tf- R c +R r 


i or 


( 20 ) 


Rl= max (/?;,/?;). 


They showed that /?, and R. are a pessimistic and an optimistic version of the goodness of fit measure. It is 


shown that R 2 lies between R~ and R . . 


4-4. The Proposed Dependent Regression Models Using the Center and Range Method (DCRM) 

The center and range method is based on independency of the two models, the center model defined in (10) and 
range model defined in (11). That is the predictor interval-valued variables for both center and range models are 
independent as follows: 

The center model: Y c = X c L + c , 

and 

The range model: Y r = X r 1 + 1 . 

This work proposes the case where the predictors of the center X c and of the range X r are dependent according to 
the following relation: 

X r = X q + £ r , and then the center and range models are defined as follows: 


The center model: Y c = X c L + c , (19) 

and 

Y r = (X c a)j3 r +£ r = X c ctf3 r + £ r 

The range model: , (20) 

.-. Y r =X c *+ r 


where (3 = Ct/3 r . The least squares estimator of [Y 


of in (20) is defined as: 


(x c f \x c j Y r . 


in (19) is as defined in (12), and the least squares estimator 


( 21 ) 


The derivation of (21) will be presented in the Appendix (A). 
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5. APPLICATIONS 

A proposed dependent center and range linear regression models will be estimated and compared with the 
independent models using a real interval-valued data. The real data are on cardiology and are obtained in terms of center 
and range values of intervals. 

The cardiological interval data contain a record of the pulse rate (Y), systolic blood pressure (XI), and diastolic 
blood pressure (X2) taken from 11 patients as shown in Table (1). 


Table 1: The Cardiological Interval-Valued Data 


No 

PulseC 

SystC 

DiastC 

PulseR 

SystR 

DiastR 

1 

56 

95 

60 

24 

10 

20 

2 

66 

110 

80 

12 

40 

20 

3 

73 

160 

95 

34 

40 

10 

4 

91 

126 

94 

42 

32 

28 

5 

63 

95 

60 

18 

10 

20 

6 

85 

145 

95 

30 

3 

30 

7 

69 

80 

145 

12 

40 

10 

8 

86 

145 

83 

28 

30 

14 

9 

87 

150 

90 

22 

80 

40 

10 

91 

159 

100 

10 

42 

20 

11 

93 

130 

89 

14 

40 

22 


The least squares estimators and the estimated standard deviations of the response center and range 
( a(Y c j, (K r )), for the different estimation methods CM, CRM, CCRM, DCRM, and DCCRM are given in Table (2). It is 
found that the estimated standard deviations of the response center <t(Y c ) are equal for all center models, but the estimated 
standard deviations of the response range j differ for all models. Also, it is found that ) for DCRM is less than 

CRM, and j for DCCRM is also less than CCRM. (See Figure (1)). This means that the linear relation between the 
center and range interval-valued variables should be considered when handling linear regression models for interval-valued 
variables. 

6. CONCLUSIONS AND RECOMMENDATIONS 

The main objective of this paper is fitting the linear regression model using interval-valued variables. The method 
of least squares is used for fritting the independent and dependent regression models. The estimation results support the 
dependent regression models. This means that, for the interval-valued variables, the relation between the center and the 
range of the data is a real relation for this type of data. Therefore, this paper recommends using the dependency relation 
between the center and the range when handling the interval-valued data. In the future, this dependency relation will be 
considered and studied for collinear, influential, outlying interval-valued data. 

Table 2: Least Squares Estimation Results for the Different Models (CM, CRM, CCRM), and the 
Proposed Models (DCRM, DCCRM) for the Cardiological Interval-Valued Data 


Method 

Intercept /? 

Systolic f3 ] 

Diastolic /3 2 

^) 0 r fc) 

CM 

21.171 

0.32889 

0.16985 

9.517 

CRM: C 

21.171 

0.32889 

0.16985 

9.517 

CRM: R 

20.215 

-0.1467 

0.34801 

11.054 


www.iaset.us 


editor@iaset. us 


















40 


Magda M. M. Haggag 




Figure 1: The Estimated Standard Deviations of the Predicted Response for the 
Models CRM, CCRM, DCRM, and DCCRM 
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APPENDIX (A) 

Derivation of the Least Squares Method for Dependent Regression Models for DCRM and DCCRM: 

The center regression model is, Y c = X c + c , and the proposed range regression model is 
Y r = X° + r as considered in (19), and (20). 


Consider the following two center and range regression models: 

y, c =tf+ 1K+-+ ;k + i- 
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y; = :+ \xf l + ...+ *x; + 


The sum of squares of the deviations is given by: 

ss DCRU =x( if +x( 'if = 2;(y;- 

/ = 1 / = 1 / = 1 

+I(y;- x...- ;x‘f. 


(A-l) 


Differentiating (A-l) w.r.t the parameters c o , ^° p , the following normal equations will be obtained: 


n + 


? Z*« + -OZ^ = I , 7 


^ + r Z Un) 2 + - + “ jz = Z >7^ > 


n 



/=1 




' e ,±(*it = ±y?K 


c c c c c c 

Then the least squares estimators of o , , p , which minimize (A-l) are obtained o , x ^ by solving 

the above (p+1) normal equations, as follows: 



- V 

r 

/ y / \ ■ 

-1 / \7 

c 1 c c 

- V 0> 1 >" 

•> p ’ j 

= 

(rjfr) 

X) 


Where, The matrix [x c J (X c ) is a (p+l)x(p+l) full column matrix defined as: 



n 

Z x n- Z x l 



[(W'f(^)] = 

Z 

/ 

Z T. x ip x n 

i / 

> 

(A-2) 


z ^ 

Z z(^) 2 

/ / 



and 





= [Z >7Z 

•••• z >7^) • 



Also, by differentiating (A-l) w.r.t the parameters 

* * 

O ’ 1 ’**' 

* 

., p, the following normal equations will be 

obtained: 





n ~:+ :t x n 

/=i 

+ ... + p 

t,x; = f j Y‘. 

/=1 /=! 
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:£ * '£ ( x °.f+■■■+",£ x ; x ‘. - £ rm 


:£ x t + <1 +"?£ (k ) 1 =£ y ,‘ x i 


/=1 /=1 /=1 /=1 

* * * 

Then the least squares estimators of o , xp , which minimize (A-l) are obtained by solving the above (p+1) 
normal equations, as follows: 


where, the matrix [(a^ f (V c )J is a (p+l)x(p+l) full column matrix defined as in (A-2), and 


The estimated center and range responses are defined, respectively as: 

V e = (x e J/3 e , and Y r =(x c JjB r . 
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